Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

نویسندگان

Min Fu

Dan Feng

Yu Hua

Xubin He

Zuoning Chen

Wen Xia

Fangting Huang

Qing Liu

چکیده

In deduplication-based backup systems, the chunks of each backup are physically scattered after deduplication, which causes a challenging fragmentation problem. The fragmentation decreases restore performance, and results in invalid chunks becoming physically scattered in different containers after users delete backups. Existing solutions attempt to rewrite duplicate but fragmented chunks to improve the restore performance, and reclaim invalid chunks by identifying and merging valid but fragmented chunks into new containers. However, they cannot accurately identify fragmented chunks due to their limited rewrite buffer. Moreover, the identification of valid chunks is cumbersome and the merging operation is the most time-consuming phase in garbage collection. Our key observation that fragmented chunks remain fragmented in subsequent backups motivates us to propose a History-Aware Rewriting algorithm (HAR). HAR exploits historical information of backup systems to more accurately identify and rewrite fragmented chunks. Since the valid chunks are aggregated in compact containers by HAR, the merging operation is no longer required. To reduce the metadata overhead of the garbage collection, we further propose a Container-Marker Algorithm (CMA) to identify valid containers instead of valid chunks. Our extensive experimental results from real-world datasets show HAR significantly improves the restore performance by 2.6X–17X at a cost of only rewriting 0.45–1.99% data. CMA reduces the metadata overhead for the garbage collection by about 90X .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Optimization of Backup Storage using Backup History and Cache Knowledge in reducing Data Fragmentation for In_line deduplication in Distributed

The chunks of data that are generated after the backup are physically distributed after deduplication in backup system, which creates a problem know as fragmentation. Basically fragmentation basically comes into sparse and outof-order containers. The sparse container adversely affect the performance while restoring the database and garbage collection effectively , while the out-of-order contain...

متن کامل

A Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems

In chunk-based deduplication systems, logically consecutive chunks are physically scattered in different containers after deduplication, which results in the serious fragmentation problem. The fragmentation significantly reduces the restore performance due to reading the scattered chunks from different containers. Existing work aims to rewrite the fragmented duplicate chunks into new containers...

متن کامل

A Network Differential Backup and Restore System based on a Novel Duplicate Data Detection algorithm

The ever-growing volume and value of data has raised increasing pressure for long-term data protection in storage systems. Moreover, the redundancy in data further aggravates such pressure in these systems. It has become a serious problem to protect data while eliminating data redundancy, saving storage space and network bandwidth as well. Data deduplication techniques greatly optimize storage ...

متن کامل

Improving restore speed for backup systems that use inline chunk-based deduplication

Slow restoration due to chunk fragmentation is a serious problem facing inline chunk-based data deduplication systems: restore speeds for the most recent backup can drop orders of magnitude over the lifetime of a system. We study three techniques—increasing cache size, container capping, and using a forward assembly area— for alleviating this problem. Container capping is an ingest-time operati...

متن کامل

Similarity Based Deduplication with Small Data Chunks

Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

نویسندگان

چکیده

منابع مشابه

An Optimization of Backup Storage using Backup History and Cache Knowledge in reducing Data Fragmentation for In_line deduplication in Distributed

A Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems

A Network Differential Backup and Restore System based on a Novel Duplicate Data Detection algorithm

Improving restore speed for backup systems that use inline chunk-based deduplication

Similarity Based Deduplication with Small Data Chunks

عنوان ژورنال:

اشتراک گذاری